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Label propagation has proven to be a fast method for detecting communities in large complex 
networks. Recent developments have also improved the accuracy of the approach, however, a general 
algorithm is still an open issue. We present an advanced label propagation algorithm that combines 
two unique strategies of community formation, namely, defensive preservation and offensive expan- 
sion of communities. Two strategies are combined in a hierarchical manner, to recursively extract 
the core of the network, and to identify whisker communities. The algorithm was evaluated on two 
classes of benchmark networks with planted partition and on almost 25 real-world networks ranging 
from networks with tens of nodes to networks with several tens of millions of edges. It is shown 
to be comparable to the current state-of-the-art community detection algorithms and superior to 
all previous label propagation algorithms, with comparable time complexity. In particular, analysis 
on real-world networks has proven that the algorithm has almost linear complexity, C(m 119 ), and 
scales even better than basic label propagation algorithm (m is the number of edges in the network). 

PACS numbers: 89.75.Fb, 89.75.Hc, 87.23.Ge, 89.20.Hh 



I. INTRODUCTION 

Large real- world networks can comprise of local struc- 
tural modules [communities) that are groups of nodes, 
densely connected within and only loosely connected with 
the rest of the network. Communities are believed to 
play important roles in different real- world systems (e.g., 
may correspond to functional modules in metabolic net- 
works [JJ); moreover, they also provide a valuable insight 
into the structure and function of large complex net- 
works [IH3]. Nevertheless, real- world networks can reveal 
even more complex modules than communities [H [5] . 

Over the last decade the research community has 
shown a considerable interest in detecting communities in 
real-world networks. After the seminal paper of Girvan 
and Newman [6] a vast number of approaches has been 
presented in the literature. In particular, approaches op- 
timizing modularity Q (significance of communities due 
to a selected null model [7]) [8HI2], graph partition- 
ing [T31 E] and spectral algorithms [HI [TS], statistical 
methods [3] , algorithms based on dynamic processes jTBJ- 
120] . overlapping, hierarchical and mult iresolut ion meth- 
ods [TJ El [20], and other [21] (for an excellent survey 
see [22]). 

The size of large real-world networks has forced the 
research community in developing scalable approaches 
that could be applied to networks with several millions 
of nodes and billions of edges. A promising effort was 
made by Raghavan et al. |18) . who employed a simple 
label propagation to find significant communities in large 
real-world networks. Tibely and Kertesz [23] have shown 
that label propagation is in fact equivalent to a large 
zero-temperature kinetic Potts model, when Barber and 
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FIG. 1. (Color online) Results of diffusion and propagation 
algorithm applied to the network of autonomous systems of 
Internet [25]. Figure shows two community networks, where 
the largest nodes correspond to densely connected modules 
of almost 10 4 nodes in the original network. Network cores, 
extracted by the algorithm, are colored red (dark gray) and 
whisker communities are represented with transparent nodes. 
Results show that the algorithm can detect communities on 
various levels of resolution - average community sizes are 
16.38 and 588.79 nodes respectively (with Q equal to 0.475 
and 0.582 respectively). 



Clark [IT] have further refined the approach into a mod- 
ularity optimization algorithm. Just recently, Liu and 
Murata [12] have combined the modularity optimization 
version of the algorithm with a multistep greedy agglom- 
eration |24| . and derived an extremely accurate commu- 
nity detection algorithm. 

Leung et al. [19] have investigated label propagation 
on large web networks, mainly focusing on scalability is- 
sues, and have shown that the performance can be signif- 
icantly improved with label hop attenuation and by ap- 
plying node preference (i.e. node propagation strength). 
We proceed their work in developing two unique strate- 
gies of community formation, namely, defensive preser- 
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vation of communities, where preference is given to the 
nodes in the core of each community, and offensive ex- 
pansion of communities, where preference is given to the 
border nodes of each community. Cores and borders are 
estimated using random walks, formulating the diffusion 
over the network. 

Furthermore, we propose an advanced label propaga- 
tion algorithm, diffusion and propagation algorithm, that 
combines the two strategies in a hierarchical manner - 
the algorithm first extracts the core of the network and 
identifies whisker communities |26j (appendix |A| , and 
then recurses on the network core (Fig. [IJ . The perfor- 
mance of the algorithm has been analyzed on two classes 
of benchmark networks with planted partition and on 23 
real-world networks ranging from networks with tens of 
nodes to networks with several tens of millions of edges. 
The algorithm is shown to be comparable to the current 
state-of-the-art community detection algorithms and su- 
perior to all previous label propagation algorithms, with 
comparable time complexity. In particular, the algorithm 
exhibits almost linear time complexity (in the number of 
edges of the network). 

The rest of the article is structured as follows. Sec- 
tion [D] gives a formal introduction to label propagation, 
and reviews subsequent advances, relevant for this re- 
search. Section III presents the diffusion and propaga- 



tion algorithm and discusses the main rationale behind 
it. Empirical evaluation with discussion is done in sec- 
tion llV] and conclusion in section lYI 



II. LABEL PROPAGATION AND ADVANCES 

Let the network be represented by an undirected graph 
G(N,E), with N being the set of nodes of the graph 
and E being the set of edges. Furthermore, let c„ be a 
community (label) of node n, n £ N, and Af(n) the set 
of its neighbors. 

The basic label propagation algorithm (LPA) |18j ex- 
ploits the following simple procedure. At first, each node 
is labeled with an unique label, c„ = l n . Then, at each 
iteration, node is assigned the label shared by most of its 
neighbors (i.e. maximal label), 



c n — argmax \M l (n) | , 



(1) 



where Af l (n) is the set of neighbors of n that share label 
I (in the case of ties, one maximal label is chosen at ran- 
dom) . Due to the existence of multiple edges within the 
communities, relative to the number of edges between the 
communities, nodes in a community will adopt the same 
label after a few iterations. The algorithm converges 
when none of the labels change anymore (i.e. equilib- 
rium is reached) and nodes sharing the same label arc 
classified into the same community. 

The main advantage of label propagation is its near lin- 
ear time complexity - the algorithm commonly converges 
in less then 10 iterations (on networks of moderate size). 



Raghavan et al. [TB] observed that after 5 iterations 95% 
of nodes already obtain their "right" label. Their obser- 
vation can be further generalized: the number of nodes 
that change their label on first four iterations roughly 
follow the sequence 90%, 30%, 10%, 5%. However, due 
to the algorithm's simplicity, the accuracy of identified 
communities is often not state-of-the-art (section IV). 

Leung et al. [19] have noticed that the algorithm, ap- 
plied to large web networks, often produces a single large 
community, occupying more than a half of the nodes of 
the network. Thus, they have proposed a label hop at- 
tenuation technique, to prevent the label from spreading 
too far from its origin. Each label l n has associated an 
additional score s n (initially set to 1) that decreases after 
each propagation (Eq. ([I])). Hence, 



max s, 

ieW c ™ (n) 



(2) 



with (5 being the attenuation ratio. When s n reaches 
0, the label can no longer propagate onward (Eq. 
which successfully eliminates the formation of a single 
major community |19) . 

Leung et al. []j5] have also shown that hop attenua- 
tion has to be coupled with node preference f n (i.e. node 
propagation strength), in order to achieve superior per- 
formance. The label propagation updating rule (Eq. (fll) 
is thus reformulated into 



argmax 



E ft 



(3) 



where w n i is the edge weight (equal to 1 for unweighted 
graphs) and a is a parameter of the algorithm. They 
have experimented with preference equal to the degree 
of the node, /j = fcj and a = 0.1, however, no general 
approach was reported. 

Label hop attenuation in Eq. ([2| can be rewritten 
into an equivalent form that allows altering 8 during the 
course of the algorithm [19] . One keeps the label distance 
from the origin d n (initially set to 0) that is updated after 
each propagation. Hence, 



mm 
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when the score s„ is 



s n = 1 - Sd n . 



(4) 



(5) 



Raghavan et al. j!8j have already shown that the up- 
dating rule of label propagation (Eq. ([I])), or its refine- 
ments (Eq. ([3])), might prevent the algorithm from con- 
verging. Imagine a bipartite network with two sets of 
nodes, i.e. red and blue nodes. Let, at some iteration of 
the algorithm, all red nodes share label l r , and all blue 
nodes share label lb- Due to the bipartite structure of 
the network, at the next iteration, all red, blue nodes 
will adopt label If,, l r respectively. Furthermore, at the 
next iteration, all nodes will recover their original labels, 
failing the algorithm to converge. 
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(a) Lancichinetti et al. benchmark n = 500, C < 50 (b) Lancichinetti et al. benchmark n = 1 000, C < 1 00 
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FIG. 2. (Color online) Comparison of node access strategies 
for label propagation on two sets of benchmark networks with 
planted partition [S7J (the results are averages over 100 real- 
izations). Network sizes equal 500, 1000 nodes respectively; 
and communities comprise of up to 50, 100 nodes respectively. 
LPA denotes basic label propagation algorithm and LPAS de- 
notes LPA without (subsequent) reshuffling of nodes. 



stability of identified community structure [T5] , especially 
in large networks. For more detailed discussion see [L2] 

Label propagation, with asynchronous updating, ac- 
cesses the nodes in a random order. Nodes are then 
shuffled after each iteration, mainly to address the prob- 
lems discussed above. Although this subsequent reshuf- 
fling does not increase the algorithm's complexity, it does 
indeed increase its computational time. Nevertheless, re- 
sults in Fig.[2]show that LPA without subsequent reshuf- 
fling of nodes (LPAS) only slightly decreases the perfor- 
mance of the basic LPA. Thus, all the approaches, pre- 
sented in the following section, use asynchronous updat- 
ing with a single (initial) shuffling of nodes. 



III. DIFFUSION AND PROPAGATION 
ALGORITHM 



The problem can be avoided with asynchronous updat- 
ing [IS]. Nodes are no longer updated all together, but 
sequentially, in random order. Thus, when node's label is 
updated, (possibly) already updated labels of its neigh- 
bors are considered (in contrast to synchronous updating 
that considers only labels from the previous iteration) . It 
should be noted that asynchronous updating can even in- 
crease the performance of the algorithm [19]. 

Furthermore, when a node has equally strong connec- 
tions with two or more communities, its label would, 
in general, constantly change [TSJ Qj|]. The problem 
is particularly apparent in author collaboration (co- 
authorship) networks, where a single author often col- 
laborates with different research communities. On the 
collaboration network of network scientists [9] the basic 
label propagation algorithm fails to converge, as there are 
up to 10% of nodes that would change their label even 
after 10000 iterations - results suggest that there are at 
least 20% of nodes, i.e. over 300 scientists, collaborating 
with different research communities [2"S] . 

Leung et al. [TS] suggested including concerned label it- 
self into the maximal label consideration (and not merely 
neighbors' labels); however, we use a slightly modified 
version [TSJ. When there are multiple maximal labels 
among neighbors, and one of them equals the concerned 
label, the node retains its label. The main difference 
here is that the modified version considers concerned la- 
bel only when there exist multiple maximal labels among 
neighbors. On the discussed collaboration network, such 
an algorithm converges in around 4 iterations. 

Never converging nodes can also be regarded as a clear 
signature of overlapping communities [JJ, where nodes 
can belong to multiple communities. Extension of la- 
bel propagation to detect overlapping communities was 
just recently proposed by Gregory [29] (and previously 
discussed in [TSJ [TS]). However, due to simplicity, we 
investigate only basic (no-overlap) versions of the label 
propagation algorithm. 

Another important issue of label propagation is the 



The section presents diffusion and propagation algo- 
rithm that combines several approaches, also introduced 
in this section. We thus give here a brief review of these. 

First, we further analyze label hop attenuation for LPA 
(section [Tlj) and propose different dynamic hop attenua- 
tion strategies in section [III A[ Next, we consider various 
approaches for node propagation preference (section [Tl| . 
By estimating node preference by means of the diffusion 
over the network, we derive two algorithms that result in 
two unique strategies of community formation, namely, 
defensive preservation and offensive expansion of com- 
munities. The algorithms are denoted defensive and of- 
fensive diffusion and attenuation LPA (DDALPA and 



ODALPA); and are presented in section IIIB 



The DALPA algorithms are combined into basic diffu- 
sion and propagation algorithm (BDPA), preserving the 
advantages of both defensive and offensive approach (sec- 
BDPA already achieves superior results on 



tion 
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networks of moderate size (section IV ) , for the use also 



with larger networks, the algorithm is further enhanced 
with core extraction and whiskers identification. The 
improved algorithm is denoted (general) diffusion and 
propagation algorithm (DPA); and is presented in sec- 
tion inrc] 



A. Dynamic hop attenuation 

Hop attenuation has proven to be a reliable technique 
for preventing the emergence of a major community, oc- 
cupying most of the nodes of the network [TjJ]. It is, how- 
ever, not evident what should the value of attenuation 
ratio S be (Eq. |2])). Leung et al. [JJ5] have experimented 
with values around 0.10, and obtained good results, still 
their experimental setting was rather limited. Further- 
more, our preliminary empirical analysis suggests that 
there is no (simple) universal value for <5, applicable for 
all different types of networks (results are omitted). 



4 



Leung et al. [T!5] have also observed that large values 
of 5 may prevent the natural growth of communities and 
have proposed a dynamic strategy that decreases S from 
0.50 towards 0. In the early iterations of the algorithm, 
large values of S prevent a single label from rapidly oc- 
cupying large set of nodes and ensure the emergence of 
a number of strong community cores. The value of 5 
is then decreased, to gradually relax the restriction and 
to allow formation of the actual communities depicted 
in the network topology. Results on real-world networks 
show that such a strategy has very good performance on 
larger networks (section IV I; still, the results can be fur- 
ther improved. Empirical evaluation in section IV also 
proves that the strategy is too aggressive for smaller net- 
works, where it is commonly outperformed even by basic 
LPA. 

We propose different dynamic hop attenuation strate- 
gies, based on the hypothesis that hop attenuation should 
only be employed, when a community, or a set of commu- 
nities, is rapidly occupying a large portion of the network. 
Otherwise, the restriction should be (almost) completely 
relaxed, to allow label propagation to reach the equilib- 
rium unrestrained. Thus, the approach would retain the 
dynamics of label propagation and still prevent the emer- 
gence of a major community. 

We have considered several strategies for detecting the 
emergence of a large community, or a set of large com- 
munities. Due to limited space, we limit the discussion 
to two. After each iteration, the value of S (initially set 
to 0) is updated according to the following rule: 

nodes: 5 is set to the proportion of nodes that changed 
their label, 

communities: S is set to the proportion of communities 
(i.e. labels) that disappeared. 

Both strategies successfully address the problem of a ma- 
jor community formation, however, the detailed compar- 
ison is omitted. The algorithms proposed here all use 
nodes strategy, due to much finer granularity, opposed to 
the communities approach - after 4 iterations the num- 
ber of communities is, in general, already 20 times smaller 
than the number of nodes (section [TTJ) ; thus, the estimate 
of 5 is rather rough for the communities strategy. For 
the empirical evaluation see section |TV| 



B. Defensive and offensive propagation 

Leung et al. [19] have proved that using node pref- 
erence, to increase the propagation strength (i.e. label 
spread) from certain nodes, can improve the performance 
of basic LPA. We conducted several experiments by us- 
ing variations of different measures of node centrality for 
node propagation preference (i.e. degree and eigenvector 
centrality 30, ,31] and node clustering coefficient |32|). 
Results are omitted, however, they clearly indicate that 
none of these static measures applies for all different 
types of networks (i.e. general networks). 



We have also observed that good performance can be 
obtained by putting higher preference to the core of each 
community (i.e. to its most central nodes). For instance, 
on the Zachary's karate club network [33] . where three 
high degree nodes reside in the core of the two (natural) 
communities, degree and eigenvector centralities are su- 
perior. However, on Girvan and Newman 6j benchmark 
networks, where all the nodes have equal degree (on aver- 
age), the measures render useless and are outperformed 
by node clustering coefficient. On the Lancichinetti et al. 
[27] benchmark networks, the best performance is, inter- 
estingly, obtained by inverted degree or eigenvector cen- 
trality. The measures seem to counterpart each node's 
degree (low degree nodes have high propagation strength, 
and vice- versa) , thus, the propagation utilizes merely the 
connectedness among nodes, disregarding its strength. 

Based (also) on the above observations, we have de- 
veloped two algorithms that estimate node preference 
by means of the diffusion over the network. During the 
course of algorithms, the diffusion is formulated using a 
random walker within each of the (current) communities 
of the network. The rationale here is twofold: (1) to esti- 
mate the (label) propagation within each of the (current) 
communities |34j : and (2) to derive an estimation of the 
core and border of each (current) community (with the 
core being the most central nodes of the community and 
the border being its edge nodes). 

Let p n be the probability that a random walker, uti- 
lized on the community labeled with c„, visits node n. 
p n can be computed as 



Pn 



iSJV " (n) 



(6) 



where the sum goes over all the neighbors of n, within 
the community c n , and fc^" is the intra-community degree 
of node i. The employed formulation is similar to the 
algorithms like PageRank [35j and HITS |36j . and also 
to the basic eigenvector centrality measure. 

Finally, we present the two algorithms mentioned 
above, namely, defensive and offensive diffusion and at- 
tenuation LPA (DDALPA and ODALPA). The defensive 
algorithm applies preference (i.e. propagation strength) 
to the core of each community, i.e. /" = p n , and the 
updating rule in Eq. ([3| rewrites to 



argmax 



E 



PiSiWni. 



(7) 



On the other hand, the offensive version applies prefer- 
ence to the border of each community, i.e. /" = 1 - p,„ 
and the updating rule becomes 



argmax (1 



■p i )s i w ri 



(8) 



Opposed to the algorithm of Leung et al. [19] . the 
main novelty here is in considering (current) communi- 
ties, found by the algorithm, to estimate the (current) 
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FIG. 3. (Color online) Comparison of defensive and offensive 
label propagation on two real-world networks, i.e., social net- 
work of American football matches on an U.S. college [6] and 
metabolic network of nematode Caenorhabditis elegans [37] . 
The revealed communities are shown with pentagonal nodes 
and the sizes, and intensities of colors (shadings), of nodes 
are proportional to the sizes of communities. The networks 
comprise two relatively different community structures, con- 
sidering the distribution of sizes of the communities. That is 
rather homogeneous in the case of football and (presumably) 
power-law in the case of elegans. 



(result are omitted). 

When a node's label changes, the values p n should 
be re-estimated for each node in the concerned node's 
previous or current community. However, this would 
likely render the algorithm inapplicable on larger net- 
works. Thus, we only update the value p n (according 
to Eq. Q), when the node n changes its label (initially 
all p n are set to l/|iV|). Although the approach is only 
a rough approximation of an exact version, preliminary 
empirical experiments reveal no significant gain by using 
the exact values for p n . 

Defensive and offensive label propagation algorithms 
result in two unique strategies of community formation, 
namely, defensive preservation and offensive expansion 
of communities. The defensive algorithm quickly estab- 
lishes a larger number of strong community cores (in the 
sense of Eq. ([7])) and is able to defensibly preserve them 
during the course of the algorithm. This results in an im- 
mense ability of detecting communities, even when they 
are only weakly defined in the network topology. On the 
other hand, the offensive approach produces a range of 
communities of various sizes, as commonly observed in 
the real- world networks [3l [18] . Laying the pressure on 
the border of each community expands those that are 
strongly defined in the network topology. This consti- 
tutes a more natural (offensive) struggle among the com- 
munities and results in a great accuracy of the commu- 
nities revealed. 

Comparison of the algorithms on two real-world net- 
works is depicted in Fig. [3| The examples show that de- 
fensive propagation prefers networks with rather homo- 
geneous distribution of the sizes of the communities; and 
that offensive propagation favors networks with more het- 
erogeneous (e.g power-law) distribution. It should, how- 
ever, be noted that both approaches can achieve superior 
performance on both of the networks. Still, on average, 
the defensive approach performs better on social network 
football [BJ, when offensive outperforms defensive on the 
metabolic network elegans |37| . 

For an empirical analysis and further discussion of the 
algorithms see section |IV| and for pseudo-code of the 
algorithms and discussion on some of the implementation 
issues see appendix [B] 



C. Diffusion and propagation algorithm 



state of the label propagation process and then to ade- 
quately alter the dynamics of the process. 

To better estimate the border of each community, the 
offensive algorithm uses degrees ki (instead of intra- 
community degrees k^ n ) for the estimation of diffusion 
values p n (seeEq. Q). The modification results in higher- 
values of 1 — pn for nodes with large inter-community 
degrees (i.e. nodes that reside in the borders of commu- 
nities) and thus provides more adequate formulation of 
the node propagation strength for the offensive version 



Defensive and offensive label propagation (sec- 
tion III B I convey two unique strategies of community 
formation. An obvious improvement would be to com- 
bine the strategies, thus, retaining the strong detection 
ability of the defensive approach and high accuracy of the 
offensive strategy. However, simply using the algorithms 
one after another does not attain the desired properties. 
The reason is that any label propagation algorithm, be- 
ing run until convergence, finds local optimum (i.e. local 
equilibrium) that is hard to escape from. 
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Network Core & whiskers Communities 



FIG. 4. (Color online) DiagrarrQ of (general) diffusion and propagation algorithm (DPA). Algorithm combines defensive 
and offensive label propagation in a hierarchical manner (steps 1. and 2.), to extract the core of the network (red heptagon 
communities) and to identify whisker communities (blue triangle and orange square communities). Whiskers are retained as 
identified communities, when the algorithm is recursively applied to the core of the (community) network. The recursion 
continues until all of the nodes of the (current) network are classified into the same community (i.e. offensive propagation 
in step 2. flood-fills), when basic diffusion and propagation algorithm (BDPA) is applied (steps 3. and 4.). For more detailed 
discussion on the algorithms see text. 

a Figure is merely a schematic representation of the algorithm and does not correspond to the actual result on the given network. 



Raghavan et al. [TH] have already discussed the idea 
(however, in different context) that label propagation 
could be improved, if one had a priori knowledge about 
community cores. Core nodes could then be labeled with 
the same label, leaving all the other nodes labeled with 
an unique label. During the course of the algorithm, the 
(uniquely labeled) nodes would tend to adopt the label 
of their nearest attractor (i.e. community core) and thus 
join its community. This would improve the algorithm's 
stability [18] and also the accuracy of the identified com- 



munities (section IV I 



The defensive and offensive label propagation algo- 
rithms are combined in the following manner. First, the 
defensive strategy is applied, to produce initial estimates 
of the communities and to accurately detect their cores. 
All border nodes of each community are then relabeled 
(labeled with an unique label) , so that approximately one 
half of the nodes retain their original label. Last, the of- 
fensive strategy is applied, which refines the community 
cores and accurately detects also their borders. Such 
combined strategy preserves advantages of both, defen- 
sive and offensive, label propagation algorithms and is de- 
noted basic diffusion and propagation algorithm (BDPA) . 
Schematic representation of the algorithm is depicted in 
Fig. |i] (steps 3. and 4.). 

The core (and border) of each community is estimated 
by means of diffusion p n (section IIIB). As core nodes 



possess more intra-community edges then border nodes, 
this results in higher values of p n for core nodes. Thus, 
within the algorithm, the node n is relabeled due to the 
following rule, 



for p n > m c 
for p n < m c 



(9a) 
(9b) 



where m Cn is the median of values p n , for nodes in com- 
munity c„, and l n is an unique label. Thus, the core nodes 
retain their original labels, when all border nodes are re- 
labeled. Note that all nodes, with p n equal to median, 
are also relabeled, to adequately treat smaller commu- 
nities, where most of the nodes share the same value of 

Pn- 

Empirical evaluation shows that BDPA significantly 
outperforms basic LPA and also the algorithm of Leung 
et al. jJH] on smaller networks. However, when networks 
become larger, the hop attenuation strategy of Leung 
et al. |19| produces much larger communities, with higher 
values of modularity (on average) . 

Different authors have proposed approaches that de- 
tect communities in a hierarchical manner (e.g. |10j). 
The algorithm is first applied to the original network 
and initial communities are obtained. One then con- 
structs the community network, where nodes represents 
communities, and edges are added between them, when 
their nodes are connected in the original network. The 
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algorithm is then recursively applied to the community 
network and the process repeats. At the end, best com- 
munities found by the algorithm, are reported (due to 
some measure). 

The idea was also proposed in the context of label prop- 
agation |19j : however, the authors did not report any 
empirical results. We have analyzed the behavior of hi- 
erarchical label propagation on real-world networks and 
also on benchmark networks with planted partition. The 
analysis has shown that, on the second iteration (when 
the algorithm is first run on the community network), the 
label propagation (already) produces one major commu- 
nity or even flood-fills (all nodes are classified into the 
same community). 

Although the analysis revealed undesirable behavior, 
we have observed that the major community commonly 
coincides with the core of the network, where other com- 
munities correspond to whisker communities. Leskovec 
et al. [3] have extensively analyzed large social and in- 
formation networks and observed that (these) networks 
reveal clear core-periphery structure - most of the nodes 
are in the central core of the network that does not have 
a clear community structure, whereas the best commu- 
nities reside in the periphery (i.e. whiskers) that is only 
weakly connected with the core. For further discussion 
see appendix |A") 

Based on the above observations, we propose the fol- 
lowing algorithm denoted (general) diffusion and prop- 
agation algorithm (DPA) - schematic representation of 
the algorithm is depicted in Fig. |4j First, defensive label 
propagation is applied to the original network (step 1.), 
which produces a larger number of smaller communities 
that are used to construct corresponding community net- 
work. Second, the offensive label propagation is used on 
the constructed community network (step 2.), to extract 
the core of the network (i.e. its major community) and 
to identify whisker communities (i.e. all other commu- 
nities). The above procedure is then recursively applied 
only to the core of the (community) network, when the 
whisker communities are retained as identified communi- 
ties. The recursion continues until the offensive propa- 
gation in step 2. flood-fills (i.e. extracted core contains 
all of the nodes of the network analyzed), when the basic 
BDPA is applied (steps 3. and 4.). 

Empirical analysis on real-world networks shows that 
DPA outperforms all other label propagation algorithms 
(with comparable time complexity) and is comparable to 
current state-of-the-art community detection algorithms. 
Furthermore, the algorithm exhibits almost linear com- 
plexity (in the number of edges of the network) and scales 
even better than the basic LPA. It should also be noted 
that the application of the algorithm is not limited to 
networks that exhibit core-periphery structure. 

For a thorough empirical analysis and further discus- 
sion on both presented algorithms see section [iVj and for 
pseudo-code of the algorithms and discussion on some of 
the implementation issues see appendix [Bj 



IV. EVALUATION AND DISCUSSION 

The section presents results of the empirical evaluation 
of the proposed algorithms. 

Algorithms were first compared on two classes of 
benchmark networks with planted partition, namely, Gir- 
van and Newman [S] and Lancichinetti et al. 27\ bench- 
mark networks. For the latter, we also vary the size of 
the networks (1000 and 5000 nodes) and the size of the 
communities (from 10 to 50 and from 20 to 100 nodes). 
Results are assessed in terms of normalized mutual infor- 
mation (NMI) [52] and are shown in Fig. [5] 

Analysis clearly shows the difference between defensive 
and offensive propagation, especially on larger networks 
(Fig.[5](d,e)). The offensive propagation (ODALPA) per- 
forms slightly better than the basic LPA, and can still 
relatively accurately detect communities, when LPA al- 
ready performs rather poorly (Fig. ^ (d)). On the other 
hand, the defensive propagation (DDALPA) does not 
detect communities as accurately as the other two ap- 
proaches (Fig. [5] (d,e)), however, the algorithm still re- 
veals the communities even when they are only weakly 
defined (and the other two approaches clearly fail). In 
other words, the defensive algorithm has high recall, 
whereas the offensive approach achieves high precision. 

Furthermore, BDPA (and DPA) outperforms all three 
aforementioned algorithms. Note that the performance 
does not simply equal to the upper-hull of those for 
DDALPA and ODALPA. The analysis also shows that 
core extraction (i.e. DPA) does not improve the results 
on networks with thousands of nodes or less; the slight 
improvement on Girvan and Newman @l benchmark re- 
sults only from hierarchical investigation, and not core 
extraction. Nevertheless, as shown below, the results can 
be significantly improved on larger networks. 

Lancichinetti and Fortunato [53] have conducted a 
thorough empirical analysis of more then 10 state-of-the- 
art community detection algorithms. To enable the com- 
parison, the benchmark networks in Fig. [5] were selected 
so they exactly coincide with those used in [53 . By 
comparing the results, we can conduct that DPA does 
indeed perform at least as good as the best algorithms 
analyzed in [53] , namely, hierarchical modularity opti- 
mization of Blondel et al. |10j . model selection approach 
of Rosvall and Bergstrom [16], spectral algorithm pro- 
posed by Donetti and Munoz |15] and multiresolution 
spin model of Ronhovde and Nussinov |20| . Moreover, 
on larger networks (Fig. [5] (d,e)), DPA obtains even bet- 
ter results than all of the algorithms analyzed in |53j - 
for [i — 0.8, none of the analyzed algorithms can obtain 
NMI above 0.35, when the values for DPA are 0.651, 
0.541 respectively. 

DPA (and BDPA) was further analyzed on 23 real- 
world networks (Table |l|, ranging from networks with 
tens of nodes to networks with several tens of millions 
of edges [54]. To conduct a general analysis, we have 
considered a wide range of different types of real-world 
networks, in particular, social, communication, citation, 




g (d) Lancichinetti etal. benchmark n = 5000, C = [10,50] 



■ ■ m i f . ■ i « i » i f . * .a -■ 



(e) Lancichinetti etal. benchmark n = 5000, C = [20,100] 



— LPA 
-* — DDALPA 
-▼ — ODALPA 
-# — BDPA 
-■ — DPA 



I i i i i 



0.2 0.4 0.6 

Mixing parameter \i 




0.2 0.4 0.6 

Mixing parameter \i 



FIG. 5. (Color online) Comparison of the proposed algorithms on two classes of benchmark networks with planted partition, 
namely, Girvan and Newman [6] benchmark networks and four sets of Lancichinetti et al. |27| benchmark networks (the results 
are averages over 100 realizations). Network sizes equal 128, 1000 and 5000 nodes; and communities comprise of up to 100 
nodes. (Gray) straight lines at fi = 0.5 denote the point beyond which the communities are no longer defined in the strong 
sense [15] . 



collaboration, web, Internet, biological and other net- 
works (all networks were treated as unweighted and undi- 
rected.). Due to a large number of networks considered, 
detailed description is omitted. 

DPA algorithm was compared with all other proposed 
label propagation algorithms (due to our knowledge) and 
a greedy modularity optimization approach (Table [I]). 
The algorithms are as follows: LPA denotes basic label 
propagation [TS] and LPAD denotes LPA with decreasing 
hop attenuation and node preference equal to the degree 
of the node [19] (section]!!]). The modularity optimization 
version of LPA is denoted LPAQ [11] and its refinement 
with multistep greedy merging LPAM [12] . Furthermore, 
GMO denotes greedy modularity optimization proposed 
by Clauset et al. [8]. 

For each algorithm, we report peak (maximal) modu- 



larities obtained on the networks analyzed. Modularities 
for LPA, LPAD, BDPA and DPA were obtained by run- 
ning the algorithms from 2 to 100000 times on each net- 
work (depends on the size of the network) . On the other 
hand, peak modularities for LPAQ and LPAM (and also 
GMO) were reported by Liu and Murata [12] . 

The results show that DPA outperforms all other la- 
bel propagation algorithms, except LPAM on networks 
of medium size (i.e. elegans, emails, pgp and codmaft). 
However, further analysis reveals that, on these networks, 
LPAM already has considerable time complexity com- 
pared to DPA. It should also be noted that modulari- 
ties, obtained by LPAM on three of these networks, cor- 
respond to the highest modularity values ever reported 
in the literature. Similarly, peak modularities obtained 
by DPA (and some others) on smaller networks also 
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Network 



Description 



Nodes Edges GMO LPA LPAD LPAQ LPAM BDPA DPA # c.e 



time 



karate 
dolphins 

books 
football 
elegans 
jazz 
netsci 
yeast 
emails 
power 
blogs 
P9P 
asi 
codmai 3 
codmat 5 
kdd 3 
nec 
epinions 
amazon 3 
ndedu 
google 
nber 
live 



Zachary's karate club. [33] 
Lusseau's bottlenose dolphins. |38] 
Co-purchased political books. 39 

American football league. [6] 
Metabolic network C. elegans. [37] 
Jazz musicians. [40] 
Network scientists. [5] 
Yeast protein interactions. |41] 
Emails within an university. [42j 
Western U.S. power grid. [32] 
Weblogs on politics. [43] 
PGP web of trust. [H] 
Autonomous syst. of Internet. [25) 
Cond. Matt, archive 20030 [45] 
Cond. Matt, archive 2005 37 45 
KDD- Cup 2003 dataset-146] 
nec web overlay map. [47] 
Epinions web of trust. [48] 
Amazon co-purchasing 2003. [49j 
Webpages in nd.edu domain. [50] 
Web graph of Google. [3] 
NBER patents citations. [5T] 
Live Journal friendships. 3 
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a Reduced to the largest component of the original network. 

b Obtained with slightly modified version of DPA (see caption). 

c Average number of core extractions and computational times for DPA. 



TABLE I. Peak (maximal) modularities Q for various label propagation algorithms and a greedy optimization of modularity. 
The modularity for DPA for elegans was obtained with 8 m ax = 1 and for asi with S max = (appendix |5| ; else the values are 
0.420 and 0.588 respectively. Opaque solid values correspond to the approaches that have significant time complexity compared 
to DPA. 



equal the highest modularities ever published (due to our 
knowledge, the modularity for football even slightly ex- 
ceeds the highest value ever reported, i.e. 0.606, opposed 
to 0.605). In summary, DPA obtains significantly higher 
values of modularity than other comparable label prop- 
agation approaches, especially on larger networks (with 
millions of nodes and edges). 

As already discussed in section |III C[ BDPA achieves 
superior results on smaller networks, better than LPA, 
LPAQ and LPAD (and GMO). However, the algorithm 
is not appropriate for larger networks, where hierarchical 
core extraction prevails (i.e. DPA). 

We have also analyzed the number of core extractions 
(section III C I , made by DPA on these networks (Table[I]) . 
Core extraction does not gain on networks with less then 
thousands of nodes or edges, where the average number is 
commonly close to 0. However, when networks become 
larger, a (single) core extraction produces a significant 
gain in modularity (on these networks). Interestingly, 
even on the network with several millions of nodes and 
several tens of millions of edges (i.e. live), the number of 
extractions is still 1 (on average). 

Next, we have thoroughly compared the time complex- 
ity of a simple LPA and DPA (and also LPAM [12]). On 
each iteration of the algorithms, each edge of the network 
is visited (at most) twice. Thus the time complexity of a 



single iteration equals 0(m), with m being the number 
of edges. The complexity for DPA is even lower, after 
the core has been extracted, however, due to simplicity, 
we consider each iteration to have complexity 0(m). 

Iterative algorithms (like label propagation) are com- 
monly assessed only on smaller networks, where the num- 
ber of iterations can be bounded by a small constant. 
In this context, both LPA and DPA exhibit near lin- 
ear complexity, 0(m). However, on networks with thou- 
sands or millions of nodes and edges, this "constant" in- 
deed increases - even for simple LPA, which is known by 
its speed, the number of iterations notably increases on 
larger networks. We have thus analyzed the total num- 
ber of iterations, made by the algorithms on real-world 
networks (Table [I]). The results are shown in Fig. [6] (the 
number of edges m is chosen to represent the size of the 
network). Note that the number of iterations for DPA 
corresponds to the sum of the iterations, made by all of 
the algorithms run within (i.e. DDALPA, ODALPA and 
BDPA). 

As already discussed earlier, DPA (and LPA) scale 
much better than LPAM - the average number of itera- 
tions on the network with tens of millions of edges is 147, 
78 for DPA, LPA respectively, when LPAM already ex- 
ceeds 300 iterations on networks with tens of thousands 
of edges. Furthermore, results also show that DPA scales 
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10 10 
Edges m 

FIG. 6. (Color online) Time complexity of different la- 
bel propagation algorithms estimated on real-world net- 
works from Table [I] (results are averages over 100 iterations). 
From top to bottom, straight lines correspond to 0.83 m ' 51 , 
5.15 m 019 and 1.03 m 0,23 , when the text denotes the over- 
all time complexity of the algorithms (LPAM, DPA and LPA 
respectively). On a network with a billion edges, the (pro- 
jected) number of iterations for DPA, LPA would equal 265, 
113 respectively. 



even better than simple LPA (i.e. C(m 119 ), opposed to 
0(m 123 )), however, it is outperformed by LPA due to a 
larger constant. Nevertheless, the analysis shows promis- 
ing results for future analyses of large complex networks. 

In the context of analyzing large networks, it should be 
mentioned that by far the fastest convergence is obtained 
by using the defensive propagation algorithm DDALPA 
(section IIIBl. On the largest of the networks (i.e. live), 



the algorithm converges in only 25 iterations (three times 
faster than LPA), still, the modularity of the revealed 
community structure is only 0.470. 

Last, we have also studied the stability of DPA (and 
BDPA), and compare it with simple LPA. The latter 
is known to find a large number of distinct community 
structures in each network [T^J HS1 H3] , when Tibely and 
Kertesz [23] have also argued that these are relatively 
different between themselves. Indeed, on zachary net- 
work LPA revealed 628 different community structures 
(in 10000 iterations), when this number equals 159, 124 
for BDPA, DPA respectively. However, as the number of 
distinct communities can be misleading, we have rather 
directly compared the identified community structures. 

In Table [Tl] we show mean pairwise NMI of (distinct) 
community structures that were identified by the algo- 
rithms on selected set of real-world networks. DPA (and 
BDPA) shows to be more stable than LPA, moreover, 
the identified community structures are relatively simi- 
lar for all of the algorithms considered (in most networks 
analyzed). Interestingly, the results also seem to corre- 
late with revealed modularities in Table U - clearer the 
community structure of the network, more stable the al- 
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TABLE II. Mean pairwise NMI of distinct community struc- 
tures identified by different label propagation algorithms in 
10000 iterations (on selected set of networks from Table [l]| . 



gorithms appear. Nevertheless, as indicated by various 
authors before [HI [23] , the number of different commu- 
nity structures can be very high, specially in larger net- 
works (e.g., 1116, 1330 for DPA applied to football, jazz 
network respectively). 

(Cumulative) distributions of sizes of communities, re- 
vealed with the proposed algorithms on three real-world 
networks, are shown in Fig. [7] 



V. CONCLUSION 

The article proposes an advanced label propaga- 
tion community detection algorithm that combines two 
unique strategies of community formation. The algo- 
rithm analyzes the network in a hierarchical manner that 
recursively extracts the core of the network and identi- 
fies whisker communities. Algorithm employs only local 
measures for community detection, and does not require 
the number of communities to be specified beforehand. 
The proposition was rigorously analyzed on benchmark 
networks with planted partition and on a wide range 
of real-world networks, with up to several millions of 
nodes and tens of millions of edges. The performance 
of the algorithm is comparable to the current state-of- 
the-art community detection algorithms, moreover, the 
algorithm exhibits almost linear time complexity (in the 
number of edges of the network) and scales even better 
than the basic label propagation algorithm. The proposal 
thus gives prominent grounds for future analysis of large 
complex networks. 

The work also provides further understanding on dy- 
namics of label propagation, in particular, how different 
propagation strategies can alter the dynamics of the pro- 
cess and reveal community structures, with unique prop- 
erties. 



ACKNOWLEDGMENTS 

The authors wish to thank (anonymous) reviewers for 
comments and criticisms that helped on improving the 
article. The work has been supported by the Slovene 
Research Agency ARRS within the research program P2- 
0359. 



11 



(a) epinions (b) google (b) nbet 
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FIG. 7. (Color online) (Cumulative) distributions of the community sizes for three real-world networks from Table |T] (for 
the epinions network, the results were averaged over 10 runs). Note some particularly large communities revealed by DPA 
in the case of google and nber networks (with round 10 4 and 10 6 nodes respectively). Interestingly, these coincide with low 
conductance [55] communities reported in [3]. 



Appendix A: Core-periphery structure 

Leskovec et al. [3] have conducted an extensive anal- 
ysis of large social and information, and some other, 
networks. They have observed that these networks can 
be clearly divided into the central core and remaining 
periphery (i.e. core-periphery structure). Periphery is 
constituted of many small, well defined communities (in 
terms of conductance [55] ) that are only weakly con- 
nected to the rest of the network. When they are con- 
nected by a single edge, they are called whiskers (or 1- 
whiskers). On the other hand, the core of the network 
consists of larger communities that are well connected 
between, and thus only loosely defined in the sense of 
communities. Their analysis have thus revealed that the 
best communities (due to conductance) reside in the pe- 
riphery of these networks (i.e. whiskers), and have a 
characteristic size of around 100 nodes. For further dis- 
cussion see [3] [56] . 



Appendix B: Algorithms 

In this section we give the pseudo-code of all the algo- 
rithms, proposed in the article (Fig.[8j Fig.[9]and Fig. 10 1, 
and discuss some of the implementation issues. 



Due to the nature of label propagation, it may be that, 
when the algorithm converges, two (disconnected) com- 
munities share the same label. This happens when node 
propagates its label in two direction, but is itself rela- 
beled in the later stages of the algorithm. Nevertheless, 
disconnected communities can be detected at the end us- 
ing a simple breath-first search. 

Each run of BDPA or DPA (Fig. [9) Fig. [To} unfolds 
several sets of communities and the best are returned 
at the end (due to some measure of goodness of com- 
munities). For the analysis in section IV algorithms re- 



ported community structure that obtained highest mod- 
ularity (computed on the original network). Thus, the re- 
sults might be attributed to modularity's resolution limit 
problem [57], or other limitations [55], still, this is not a 
direct artefact of the algorithms. 

Additional note should be made for the offensive prop- 
agation algorithm ODALPA (Fig. [8| . When used on net- 
works with several thousands of nodes or less, diffusion 
values p n should only be updated (line 13) after the first 
iteration, otherwise the algorithm might not converge. 
The reason is that, during the first iteration, commu- 
nities are still rather small (due to the size of the net- 
work) and thus all of the nodes lay in the border of the 
communities. Hence, updating diffusion values results in 
applying propagation preference to all of the nodes. 
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Input: Graph G(N, E) with weights W 
Output: Communities C (i.e. node labels) 
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for n G N do 

c„ <s— l n {Unique label.} 
d n 4- 

Pn 4~ 1/\N\ 

end for 

shuffle(N) 

while not converged do 
for n £ N do 

c n <- argmax; E ie Ar<(i) M 1 
if c„ /ias changed then 
d n 4- (min ieA /c„ (n) d t ) + 1 

end if 
end for 

S proportion of labels changed 
it 5 > S m ax then 
{imn is fixed to §.} 
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end if 
end while 
return C 
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FIG. 8. Defensive label propagation algorithm with (dy- 
namic) hop attenuation (DDALPA). In the offensive version 
(ODALPA), the node preference pi is replaced by 1—pi (line 
10) and the degree k\ n is replaced by fci (line 13). 



Input: Graph G(N, E) with weights W 
Output: Communities C (i.e. node labels) 

C 4- DDALPA(G, W) 

for c £ C do 

{Retain community cores.} 

ra c <— median ({p n \ n £ N A c n = c}) 

for n£N/\c n = c/\p n < ra c do 

c n ?n {Unique label.} 

dn 4- 

Pn {Maximal preference.} 
end for 
end for 

C 4- ODALPA{G,W) 

return C {Returns best communities.} 

FIG. 9. Basic diffusion and propagation algorithm (BDPA). 



Input: Graph G(N, E) with weights W 
Output: Communities C (i.e. node labels) 

C 4- DDALPA(G, W) 

C c 4- ODALPA(Gc,W c ) 

if Cc contains one community then 
C 4- BDPA(G, W) 

else 

{Recursion on core c in Cc-} 
C4-(C C - {c}) U DPA(G c (c), W c (c)) 
end if 

return C {Returns best communities.} 
FIG. 10. Diffusion and propagation algorithm (DPA). 
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